mtmd: be able to use alternative types for the K*Q multiplication #1567
Merged
mtmd: be able to use alternative types for the K*Q multiplication #1567
Conversation
|
... and congrats for 1000 PRs closed! I guess that these DO matter, prove your journey right and not the Git stars! 🙏🥃 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I thought I should give some attention to the multi-modality stuff. The initial idea was that I would enable flash attention (FA). But that turned out to be too a big change as multi-modality models like to use strange attention head sizes. While looking into this I noticed that a very large fraction of the time needed to encode the image is spent in the
K*Qmatrix multiplication. So, I decided to see if that could be made somewhat faster.When not using FA the
K*Qmatrix multiplication is done using 32-bit floats. An obvious thing to try is to see if down casting tof16/bf16, or perhaps even toQ8_0would bring some performance benefit. Hence, this PR adds the ability to define the type used for theK*Qmatrix multiplication via a command line argumentSomewhat surprisingly, I only see a performance improvement when running CPU-only on a Zen4 CPU (Ryzen-7950X) and using
--mtmd-kq-type bf16. In that case, for a 1 MiB image, which generates 4015 image tokens, encoding time is reduced to 65 seconds from 76 seconds (I thought that was much too long, so tested the same image with today'sllama.cpp. It needed ~300 seconds to encode the same image on the same CPU).I also played with converting to
Q8_0. That seems to work just fine (in terms of the generated response), but does not give a performance benefit. I guess, part of the issue is that the Qwen3 vision encoder has a head size of 72, so to useQ8_0one must padKandQto a row size of 96, which a) takes time and b) makes the matrix multiplication 78% larger.